Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: add return_inverse to duplicated for DataFrame/Series/Index/MultiIndex #21645

Closed
wants to merge 11 commits into from

Conversation

h-vetinari
Copy link
Contributor

Closes #21357

I followed the recommendation of @jreback there to change duplicated, and to make the output a tuple if return_inverse=True.

I didn't find a test for the behavior of DataFrame.duplicated() -- there's a test of Index.duplicated() and Series.duplicated() in tests/test_base.py, but nothing I could find in tests/frame/.

So I added a file to test to test current behavior, as well as the new kwarg. Notably, the backend between the different Series in the output tuple is different (pandas._libs.hashtable.duplicated_int64 vs. np.unique), and so I also added several tests that these results are exactly compatible.

@jschendel
Copy link
Member

jschendel commented Jun 26, 2018

Can you add a whatsnew entry to 0.24.0?

Also, looks like we have asvs for DataFrame.duplicated. Might be nice to add methods for return_inverse=True to these. Not sure if this is strictly necessary though.

class Duplicated(object):
goal_time = 0.2
def setup(self):
n = (1 << 20)
t = date_range('2015-01-01', freq='S', periods=(n // 64))
xs = np.random.randn(n // 64).round(2)
self.df = DataFrame({'a': np.random.randint(-1 << 8, 1 << 8, n),
'b': np.random.choice(t, n),
'c': np.random.choice(xs, n)})
self.df2 = DataFrame(np.random.randn(1000, 100).astype(str)).T
def time_frame_duplicated(self):
self.df.duplicated()
def time_frame_duplicated_wide(self):
self.df2.duplicated()

compatible with ``return_inverse``.
return_inverse boolean, default False
Determines whether the mapping from unique elements to the original
index should be returned. If true, the output is a tuple.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a .. versionadded:: 0.24.0 to this?

import pandas.util.testing as tm


class TestDataFrameDuplicated(object):
Copy link
Member

@jschendel jschendel Jun 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these tests should be in tests/frame/test_analytics.py. It looks like there are a couple DataFrame.duplicated tests related to bugs there, though nothing related to basic behavior, e.g.

@pytest.mark.slow
def test_duplicated_do_not_fail_on_wide_dataframes(self):
# gh-21524
# Given the wide dataframe with a lot of columns
# with different (important!) values
data = {'col_{0:02d}'.format(i): np.random.randint(0, 1000, 30000)
for i in range(100)}
df = pd.DataFrame(data).T
result = df.duplicated()
# Then duplicates produce the bool pd.Series as a result
# and don't fail during calculation.
# Actual values doesn't matter here, though usually
# it's all False in this case
assert isinstance(result, pd.Series)
assert result.dtype == np.bool

The Series.duplicated tests are also in the associated tests/series/tests_analytics.py file, so makes sense to mirror that setup.


# keep = 'first'
exp = Series([False, False, True, False, True])
assert_series_equal(df.duplicated(keep='first'), exp)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My preference would be to use a slightly more explicit pattern throughout these tests along the lines of:

expected = ...
result = ...
assert_series_equal(expected, result)

@jschendel jschendel added this to the 0.24.0 milestone Jun 26, 2018
@codecov
Copy link

codecov bot commented Jun 27, 2018

Codecov Report

Merging #21645 into master will decrease coverage by 0.14%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #21645      +/-   ##
==========================================
- Coverage   92.18%   92.04%   -0.15%     
==========================================
  Files         169      169              
  Lines       50820    50818       -2     
==========================================
- Hits        46850    46776      -74     
- Misses       3970     4042      +72
Flag Coverage Δ
#multiple 90.45% <100%> (-0.15%) ⬇️
#single 42.19% <13.2%> (-0.19%) ⬇️
Impacted Files Coverage Δ
pandas/core/indexes/multi.py 95.41% <100%> (-0.04%) ⬇️
pandas/core/series.py 93.73% <100%> (-0.02%) ⬇️
pandas/core/base.py 96.99% <100%> (-0.62%) ⬇️
pandas/core/frame.py 97.21% <100%> (+0.01%) ⬆️
pandas/core/algorithms.py 94.8% <100%> (+0.11%) ⬆️
pandas/core/indexes/category.py 97.28% <100%> (+0.01%) ⬆️
pandas/core/indexes/base.py 96.45% <100%> (+0.05%) ⬆️
pandas/io/formats/console.py 65.15% <0%> (-10.61%) ⬇️
pandas/errors/__init__.py 92.3% <0%> (-7.7%) ⬇️
pandas/core/dtypes/base.py 92.68% <0%> (-7.32%) ⬇️
... and 55 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 44a9b16...2906b70. Read the comment docs.

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR


class TestDataFrameDuplicated(object):

def test_duplicated_keep(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you parametrize this test?

df = DataFrame({'A': [0, 1, 1, 2, 0], 'B': ['a', 'b', 'b', 'c', 'a']})

# keep = 'first'
exp = Series([False, False, True, False, True])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a pseudo-standard would be preferable to name this expected instead of exp


# keep = 'first'
exp = Series([False, False, True, False, True])
assert_series_equal(df.duplicated(keep='first'), exp)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Along the same lines do result = df.duplicated(...) on a previous line the simply assert_series_equal(result, expected)

assert_series_equal(df.duplicated(keep=False), exp)

def test_duplicated_nan_none(self):
# np.nan and None are considered equal
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm is this expected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is how it works currently - stuff like this becomes visible when writing tests. ;-)

Of course someone can open a separate issue on this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened #21720

assert_frame_equal(reconstr, df)

# keep = False
rgx = 'The parameters return_inverse=True and keep=False cannot be.*'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make this a separate test called test_duplicated_inverse_raises


Returns
-------
duplicated : Series
duplicated : Series or tuple of Series if return_inverse is True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the design consideration in going (Series, Series) instead of say (Series, ndarray)? Wondering if the latter wouldn't be preferable since it AFAICT it would mostly be used for indexing operations

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's absolutely essential that it's a Series (otherwise it'd have to be two arrays at least), because - other than in numpy - we also need to keep track of the original index.

So, the second Series here links two arrays - which original index gets mapped to which index in the deduplicated one. This is how the reconstruction can succeed:

df == unique.reindex(inv.values).set_index(inv.index)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added examples to the docstrings that illustrate the usage (as well as the need to keep track of the index of the deduplicated object in case of Series/DataFrame).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The index is still available in the first output, so it is not that essential that the second output is a Series (there is still a design issue for sure)

ids = ids[::-1] # np.unique takes first occurrence as unique value
_, o2u, u2o = np.unique(ids, return_inverse=True,
return_index=True)
inv = Series(self.index[::-1][o2u][u2o][::-1], index=self.index)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you express this in some other fashion or at the very least comment it? The double negative step logic is tough to understand at least at first glance

@h-vetinari
Copy link
Contributor Author

@jschendel @WillAyd
Thanks for the feedback, working to incorporate all the comments.

Another question that I realised while writing the whatsnew entry - should this kwarg be made available to Series/Index as well? In this case I'm guessing I'd have to adapt core.base.IndexOpsMixin.duplicated?

@h-vetinari
Copy link
Contributor Author

@jschendel Re:

Also, looks like we have asvs for DataFrame.duplicated. Might be nice to add methods for return_inverse=True to these. Not sure if this is strictly necessary though.

Haven't ever worked with the asv's yet - will have a look later.

@h-vetinari
Copy link
Contributor Author

h-vetinari commented Jun 27, 2018

I added three runs in the ASV, here are the results against current master. I don't know why frame_methods.Duplicated.time_frame_duplicated failed on master, that shouldn't be the case - all the others are explained by the missing return_inverse kwarg on master.

Edit: ran again on a clean, restarted machine with no background applications.

  before           after         ratio
[a620e725]       [094139bf]
       203ms            203ms     1.00  frame_methods.Duplicated.time_frame_duplicated
-     78.1ms           70.3ms     0.90  frame_methods.Duplicated.time_frame_duplicated_mixed
      failed            156ms      n/a  frame_methods.Duplicated.time_frame_duplicated_mixed_inverse
      failed            125ms      n/a  frame_methods.Duplicated.time_frame_duplicated_mixed_inverse_last
       203ms            203ms     1.00  frame_methods.Duplicated.time_frame_duplicated_wide
      failed            203ms      n/a  frame_methods.Duplicated.time_frame_duplicated_wide_inverse

@pep8speaks
Copy link

pep8speaks commented Jun 27, 2018

Hello @h-vetinari! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on August 31, 2018 at 08:00 Hours UTC

# will take fastpath for no duplicates
self.df2.duplicated(return_inverse=True)

def time_frame_duplicated_mixed(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of the 3 separate benchmarks below, you can parameterize over the return_inverse and keep keyword arguments.

Here's an example: https://github.com/h-vetinari/pandas/blob/094139bffa07c101882bbfbf35a1621eadc00adc/asv_bench/benchmarks/frame_methods.py#L219-L231

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the hint, coming in the commit for the next feedback round.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it normal that the times are then not split up depending on the parameter?

I get:

[...]
[ 16.67%] ··· Running frame_methods.Duplicated.time_frame_duplicated         469ms;...
[ 33.33%] ··· Running frame_methods.Duplicated.time_frame_duplicated_mixed   125ms;...
[ 50.00%] ··· Running frame_methods.Duplicated.time_frame_duplicated_wide    203ms;...
[...]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the semicolons are splitting the time depending on the parameter. The format usually looks better with asv continuous

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mroeschke The output I posted is an extract from asv continuous, but I hadn't looked into asv compare yet, where the results were hiding. :)

@h-vetinari
Copy link
Contributor Author

Test failure in circleci is unrelated. Failed in TestPythonParser.test_dtype_with_converter with

E           AssertionError: Caused unexpected warning(s): ['ResourceWarning'].

@h-vetinari
Copy link
Contributor Author

@mroeschke
I got the results now, although I had to add a temporary if-condition (otherwise master would stumble over the unknown kwarg). Will remove that condition again, but I just wanted to have the same hash show up in the comparison as well as the commit I just pushed. Here are the results:

   before      after       ratio
 [a620e725]  [36b03a1c]
      203ms       203ms     1.00  frame_methods.Duplicated.time_frame_duplicated('first', False)
     failed       578ms      n/a  frame_methods.Duplicated.time_frame_duplicated('first', True)
+     203ms       250ms     1.23  frame_methods.Duplicated.time_frame_duplicated('last', False)
     failed       438ms      n/a  frame_methods.Duplicated.time_frame_duplicated('last', True)
     78.1ms      78.1ms     1.00  frame_methods.Duplicated.time_frame_duplicated_mixed('first', False)
     failed       125ms      n/a  frame_methods.Duplicated.time_frame_duplicated_mixed('first', True)
     78.1ms      78.1ms     1.00  frame_methods.Duplicated.time_frame_duplicated_mixed('last', False)
     failed       125ms      n/a  frame_methods.Duplicated.time_frame_duplicated_mixed('last', True)
      219ms       203ms     0.93  frame_methods.Duplicated.time_frame_duplicated_wide('first', False)
     failed       203ms      n/a  frame_methods.Duplicated.time_frame_duplicated_wide('first', True)
      203ms       203ms     1.00  frame_methods.Duplicated.time_frame_duplicated_wide('last', False)
     failed       203ms      n/a  frame_methods.Duplicated.time_frame_duplicated_wide('last', True)

To me, the benchmarks seem a bit unstable, otherwise I don't know where the performance differences come from.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still have yet to review the actual code

@@ -1,7 +1,7 @@
.. _whatsnew_0231:

v0.23.1
-------
v0.23.1 (June 12, 2018)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't modify things, separate PR if you would like to change this

``DataFrame.duplicated`` has gained the ``return_inverse`` kwarg
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Previously, there was no way to determine how duplicate rows in a ``DataFrame`` got mapped to the deduplicated, unique subset. This made it hard to push back
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

too much explanation. make this a sentence or (all of the text). you don't need to explain why

index=[1, 4, 9, 16, 25])
df
isdup, inv = df.duplicated(return_inverse=True) # default: keep='first'
isdup
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spell out names


.. ipython:: python

unique = df.loc[~isdup] # same as df.drop_duplicates()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this example is confusing because you are using different code here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand. It refers to the same objects as in the code block above - it's just a continuation.

@h-vetinari
Copy link
Contributor Author

h-vetinari commented Jun 29, 2018

As I already mentioned above, it became obvious to me while writing the whatsnew entry that this kwarg should be enabled for Series/Index as well. The last commit contains a working version but no tests yet -- ran into troubles with #21684. Additional xref #21685

This is already much more elegant than the previous commits, because it does all the work for return_inverse in one spot (pandas.core.algorithms), and all the calling objects just need to wrap around that.

In summary, the following is still lacking:

  • tests for Series/Index/MultiIndex
  • asv for Series/Index/MultiIndex
  • update whatsnew

@h-vetinari
Copy link
Contributor Author

I finally finished adding all the tests, turns out it's quite complicated to do anything for Index, haha. ;-)

The core algorithm changes almost vanish compared to the amount of test changes. Some of the things I added tests for were not really tested so far (bare duplicated was only tested implicitly - and then only sometimes - for stuff like drop_duplicates). I also split up pandas/tests/indexes/multi/test_unique_and_duplicates.test_duplicated, which had a large amount of tests in one.

ASVs should be in line with what I posted above, because as long as return_inverse=False, I never calculate anything additional.

I needed to implement a different signature for pandas.core.algorithms.duplicated, which is inspired by np.unique, in the sense that it has both return_index and return_inverse (with the same meanings). This method is not public as far as I can tell? In any case, the docstring reflects the changes.
Finally, I added a stabilize-kwarg, which stabilises against the implicit sorting that numpy does. It's on by default, but turning it off is a performance improvement in case the index/inverse are immediately plugged together again (as is the case for Series/DataFrame).

Tried to explain all the code as well as possible, but happy to receive feedback (and/or answer questions).

- False : Mark all duplicates as ``True``.
- False : Mark all duplicates as ``True``. This option is not
compatible with ``return_index`` or ``return_inverse``.
return_index : boolean, default False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls pls let's keep the scope to the original. just return_inverse. If you want to do more things, make them PR's on top.

@h-vetinari
Copy link
Contributor Author

pls pls let's keep the scope to the original. just return_inverse. If you want to do more things, make them PR's on top.

@jreback I did keep strictly with the scope of just adding return_inverse to duplicated. The issue is that Series/DataFrame need a slightly different treatment than Index:

  • the inverse for Series/DataFrame needs to take into account two indexes (the original index, and the one of the deduplicated data). This needs to combine the outputs return_index=True and return_inverse=True from np.unique.
  • the inverse for Index is just a simple np.ndarray, which would in principle just need return_inverse=True from np.unique, except we need to undo the implicit sorting that np.unique does.

There are two options from my POV:

  1. implement two different duplicated-methods for Series/DataFrame vs. Index, which would however overlap if return_inverse=False.
  2. (current state of PR) implement one central duplicated-method, but with enough flexibility to cover both cases, and have thin wrappers around that. Since -- AFAIU -- pandas.core.algorithms.duplicated is not public API, I decided to add that extra flexibility.

Does that make the situation clearer? I don't mind implementing option 1, but it would lead to code duplication and would IMHO be messier.

@h-vetinari
Copy link
Contributor Author

Ugh, resolving conflicts on windows is such a pain (conflict resolution commits not recognised by git). Had to do a massive rebase...

Also added some examples to the separate duplicated-docstrings.

@h-vetinari
Copy link
Contributor Author

@WillAyd I incorporated your feedback - any open questions?

@jreback care to take a look?

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't look in detail to the implementation, but some general questions:

  • What is the reason you decided to add it to duplicated and not drop_duplicates ? (because from quickly looking at the issue, and from the docstring examples, that is actually what you want?)
    I also ask because I think it would be a more natural fit there, as it is not literally an "inverse" for the result of duplicated (as it is for drop_duplicates or as for unique in numpy)

  • I think we should have more discussion (or at least motivation for the current design) about the actual return value (series vs array, index labels vs positional integers). Always returning an array of integers would also be an option (and is eg robust for duplicates in the index)

@@ -159,6 +159,52 @@ This is the same behavior as ``Series.values`` for categorical data. See
:ref:`whatsnew_0240.api_breaking.interval_values` for more.


.. _whatsnew_0240.enhancements.duplicated_inverse:

The `duplicated`-method has gained the `return_inverse` kwarg
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double backticks were you now use single backticks (also on the lines below)

(in rst, double backticks give code-styled text)

- False : Mark all duplicates as ``True``. This option is not
compatible with ``return_inverse``.
return_inverse : boolean, default False
If True, also return the selection from the index from the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"selection from the index" sounds strange to me, and it does not really say what the values are (I understand what you mean, but I think we should think about a better wording).
Also the following lines are rather complex to understand (eg the "boolean complement of the first output")


Returns
-------
duplicated : Series
duplicated : Series or tuple of Series if return_inverse is True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The index is still available in the first output, so it is not that essential that the second output is a Series (there is still a design issue for sure)

@h-vetinari
Copy link
Contributor Author

@jorisvandenbossche

What is the reason you decided to add it to duplicated and not drop_duplicates ? (because from quickly looking at the issue, and from the docstring examples, that is actually what you want?)
I also ask because I think it would be a more natural fit there, as it is not literally an "inverse" for the result of duplicated (as it is for drop_duplicates or as for unique in numpy)

I wanted to add .unique method in #21357, but got told by @jreback that:

-1 on adding any new methods, but would be ok with adding a return_inverse kwarg (clearly defaulting to False), to .duplicated() to easily recovering the indexer.

I still think it makes sense for duplicated, which is invoked by drop_duplicates, and so would need to support that anyway if one wanted to add it to drop_duplicates.

I think we should have more discussion (or at least motivation for the current design) about the actual return value (series vs array, index labels vs positional integers). Always returning an array of integers would also be an option (and is eg robust for duplicates in the index)

numpy returns two arrays of indexes, one relative to the original set and one relative to the unique values. I think this information fundamentally belongs together and since pandas supports the index/values distinction, I strongly believe that they should live in the same object. I don't like reusing the index of duplicated itself, because then I need to keep around two separate objects to reconstruct. But happy to discuss alternative approaches.

@h-vetinari
Copy link
Contributor Author

@jorisvandenbossche
I incorporated your feedback. Since no compelling alternatives have come up for the return type, I left that aspect as-is.

@h-vetinari
Copy link
Contributor Author

@jreback @WillAyd @jorisvandenbossche

How can I/we move this PR forward?

@jorisvandenbossche
Copy link
Member

@h-vetinari sorry for the slow answer

I think my main worry is that we are adding a return_inverse keyword which actually does not return the inverse for that function (it does return the inverse for another function), and that it is in name similar to numpy's keyword, but in usage also different.

Numpy's unique is more comparable to drop_duplicates. Therefore I think such a keyword would be a more natural fit there.

numpy returns two arrays of indexes

Can you clarify this? You are talking about np.unique? (but I think that only returns a single array of indices?)

Side note: it might make sense to add this to pd.unique / Series.unique as well? (not necessarily at the same time; or might actually be an easier starter)

@h-vetinari
Copy link
Contributor Author

h-vetinari commented Sep 18, 2018

@jorisvandenbossche

I think my main worry is that we are adding a return_inverse keyword which actually does not return the inverse for that function (it does return the inverse for another function), and that it is in name similar to numpy's keyword, but in usage also different.

Fair point. I would prefer a df.unique as well, but was following what I was told in #21357

-1 on adding any new methods, but would be ok with adding a return_inverse kwarg (clearly defaulting to False), to .duplicated() to easily recovering the indexer.

which is how I implemented it here. Could still be added to drop_duplicates as well, but since that depends on duplicated, we'd need it there anyway.

Re:

numpy returns two arrays of indexes

Can you clarify this? You are talking about np.unique? (but I think that only returns a single array of indices?)

https://docs.scipy.org/doc/numpy/reference/generated/numpy.unique.html has both return_indices and return_inverse, and both are necessary to reconstruct from the pandas side (the difference for us being that we also need to keep track of the index, whereas numpy just has arange as index).

Side note: it might make sense to add this to pd.unique / Series.unique as well? (not necessarily at the same time; or might actually be an easier starter)

Agreed. Actually, for me the best solution would be:

  • add df.unique (could be a thin wrapper around drop_duplicates)
  • have df.unique, series.unique, index.unique with return_inverse-kwarg
  • dispatch from pd.unique to appropriate code branch

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i haven't looked at this in any detail, there is a lot going on here.


if keep == 'first':
# o2u: original indices to indices of ARRAY of unique values
# u2o: reduplication from array of unique values to original array
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather you not use np.unique at all, its not as performant as pd.unique, doesn't handle all dtypes, and sorts.

Further pls, pls use actual names here, and really avoid using abbreviations in any library code.

Copy link
Contributor Author

@h-vetinari h-vetinari Sep 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback

I'd rather you not use np.unique at all, its not as performant as pd.unique, doesn't handle all dtypes, and sorts.

Yes, it would be nicer to have this implemented in the cython hashtable functions, but that performance improvement is for a follow-up. np.unique is an easy solution and is invoked only for return_inverse=True (and we're only calling it on a series of ints, not objects, because the factorization for that hasn't changed!).

Further pls, pls use actual names here, and really avoid using abbreviations in any library code.

There's a fully commented, thoroughly explained and very localized part where these appear. Not sure how this is unclear, but will adapt...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally, the core changes are minimal (but I understand that it looks like a lot). TLDR: Implementation moves to core.algorithms, everything else wraps around that (+ doc improvements). Tests are expanded, and ASVs added.

@jorisvandenbossche
Copy link
Member

Fair point. I would prefer a df.unique as well, but was following what I was told in #21357

I have wanted a DataFrame.unique for a long time, but the question is how it would be different compared to drop_duplicates?

Anyway, yes @jreback put you on the path to add it to duplicated as a first idea, but I am questioning that choice. So regardless of previous recommendation, we should still discuss that (@jreback can you chime in here?)

numpy.unique has both return_indices and return_inverse, and both are necessary to reconstruct from the pandas side

I think the numpy's return_indices is rather unrelated here? It is about the index into the original array to get the unique ones, while you want to preserve the index of the original array?

Agreed. Actually, for me the best solution would be:

add df.unique (could be a thin wrapper around drop_duplicates)
have df.unique, series.unique, index.unique with return_inverse-kwarg
dispatch from pd.unique to appropriate code branch

Adding a return_inverse to the existing unique function / methods would certainly be a welcome change I think.

@jreback
Copy link
Contributor

jreback commented Sep 24, 2018

i am not against adding this to .unique, though as you have mentioned that is indeed just drop_duplicates

@h-vetinari
Copy link
Contributor Author

h-vetinari commented Sep 24, 2018

@jorisvandenbossche
Thanks for your answer:

I have wanted a DataFrame.unique for a long time, but the question is how it would be different compared to drop_duplicates?

I would suggest that unique just acts as a thin wrapper around drop_duplicates. The latter has been in the pandas API forever, but the former name is even more established, and I think they could peacefully coexist.

I think the numpy's return_indices is rather unrelated here? It is about the index into the original array to get the unique ones, while you want to preserve the index of the original array?

Assuming np.unique didn't sort (which I have to laboriously undo in this PR) and keep='first', we would have:

isduplicated = some_series.duplicated()
_, values2unique, unique2values = np.unique(some_series.values,
                                            return_index=True, return_inverse=True)
inverse = Series(some_series.index[values2unique][unique2values], index=some_series.index)

and neither values2unique (the indices) nor unique2values (the inverse) can be omitted. Again, the difference to numpy here is that some_series can have an index different from arange(len(some_series)), which is why we need to first select the right indices of the index (to get the uniques) and the broadcast them back to the full shape to get the full inverse. If we didn't care about the index of some_series, then the indices would indeed be not necessary, but then we wouldn't be able to use the inverse in .loc or .reindex of the uniques (=some_series.loc[~isduplicated]), because the indexes wouldn't match.

Adding a return_inverse to the existing unique function / methods would certainly be a welcome change I think.

I had already collected some thoughts for this during the day and opened a (largely orthogonal to this PR) issue for this: #22824

EDIT: clarified mechanics for inverse

@jorisvandenbossche
Copy link
Member

Assuming np.unique didn't sort (which I have to laboriously undo in this PR)

That is because for the inverse you rely on np.unique instead of our own implementations?
Another option might be to actually rely on our own algos for that? (I assume the hashtables already support that, as eg we need it for factorizing. Which led me think: pd.unique with a return_inverse argument is actually basically the same as pd.factorize ?)

I see how you use both return_index and return_inverse there, but that is to construct the inverse (so which would be hidden for the user), I thought you were speaking about the actual API for return_inverse towards the user. So therefore the confusion.

@h-vetinari
Copy link
Contributor Author

@jorisvandenbossche

That is because for the inverse you rely on np.unique instead of our own implementations?
Another option might be to actually rely on our own algos for that?
(I assume the hashtables already support that, as eg we need it for factorizing. Which led me think: pd.unique with a return_inverse argument is actually basically the same as pd.factorize ?)

When I started this, I just saw factorize being used to create the IDs which were then fed again into a hashtable. I looked into the hashtables code, but the duplicated_{dtype} methods had no return_inverse and I didn't want to start right away by messing in the templated cython. So I started with np.unique, which is not great (due to having to undo the sorting), but is otherwise easy to start with. The adaptation of the cython code was/would have been a follow-up for me.

Looking at the implementation of factorize, you are right, there is a get_labels of the corresponding table. This is still not immediately usable though, because (AFAICT), it has no way to distinguish between keep='first'|'last' (which is relevant even if there's no index!); and because of the different na-handling of factorize that actually leads to subtle bugs in duplicated, see #21720.

I could work around the former (like I do here for np.unique), but the latter would break the behaviour of Series.unique for mixed NaNs in object columns. And it doesn't seem as easy as setting a different na_value in factorize, as the docstring warns:

na_value : object, optional
        A value in `values` to consider missing. Note: only use this
        parameter when you know that you don't have any values pandas would
        consider missing in the array (NaN for float data, iNaT for
        datetimes, etc.).

Maybe need to add a way for factorize to deal with different NaNs first?

I see how you use both return_index and return_inverse there, but that is to construct the inverse (so which would be hidden for the user), I thought you were speaking about the actual API for return_inverse towards the user. So therefore the confusion.

Just to make sure we're on the same page: I need the return_index of np.unique internally, but do not suggest to expose that in pandas API. I only used it as an argument that we have an extra layer of information (the index) to keep track of, and why the output should be a Series (containing, in some transformed way, both the return_index and return_inverse from np.unique)

@h-vetinari
Copy link
Contributor Author

Alright, I'm closing this for now, as the lack of cython-inverse has now been solved by #23400, but for .unique rather than .duplicated. Therefore I'll follow that line of development (it's also closer to the intention of #21357).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants